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Abstract — In this paper we explore noise tolerant learning of 
classifiers. We formulate the problem as follows. We assume that 
there is an unobservable training set which is noise-free. The 
actual training set given to the learning algorithm is obtained 
from this ideal data set by corrupting the class label of each 
example. The probability that the class label of an example is 
corrupted is a function of the feature vector of the example. 
This would account for most kinds of noisy data one encounters 
in practice. We say that a learning method is noise tolerant 
if the classifiers learnt with the ideal noise-free data and with 
noisy data, both have the same classification accuracy on the 
noise-free data. In this paper we analyze the noise tolerance 
properties of risk minimization (under different loss functions), 
which is a generic method for learning classifiers. We show 
that risk minimization under 0-1 loss function has impressive 
noise tolerance properties and that under squared error loss is 
tolerant only to uniform noise; risk minimization under other 
loss functions is not noise tolerant. We conclude the paper with 
some discussion on implications of these theoretical results. 

Index Terms — Risk minimization, noise tolerance, label noise, 
loss functions. 



I. Introduction 

In most situations of learning a classifier, one has to contend 
with noisy examples. Essentially, when training examples 
are noisy, the class labels of examples as provided in the 
training set may not be 'correct'. Such noise can come through 
many sources. If the class conditional densities overlap, then 
same feature vector can come from different classes with 
different probabilities and this can be one source of noise 
JD, ED- In addition, in many applications (e.g, document 
classification etc.), training examples are obtained through 
manual classification and there will be inevitable human errors 
and biases. Noise in training data can come about by errors 
in feature measurements also. Errors in feature values would 
imply that the observed feature vector is at a different point in 
the feature space though the label remains the same and hence 
it can also be looked at as a noise corruption of the class label. 
Hence it is always desirable to have classifier design strategies 
that are robust to noise in training data. 

A popular methodology in classifier learning is (empirical) 
risk minimization. Under this, one chooses a convenient loss 
function and the goal of learning is to find a classifier that min- 
imizes risk which is expectation of the loss. The expectation 
is with respect to the underlying probability distribution over 
the feature space. In case of noisy samples, this expectation 
would include averaging with respect to noise also. 

In this paper, we study noise tolerance properties of risk 
minimization under different loss functions such as 0-1 loss, 
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squared error loss, exponential loss, hinge loss etc. We con- 
sider what we call non-uniform noise where the probability of 
the class label for an example being incorrect, is a function 
of the feature vector of the example. This is a very general 
noise model and can account for most cases of noisy datasets. 
We say that risk minimization (with a loss function) is noise 
tolerant if the minimizers under the noise-free and noisy 
cases have the same probability of mis-classification on noise- 
free datasets. We present some analysis to characterize noise 
tolerance of risk minimization with different loss functions. 

As we show here, the 0-1 loss function has very interesting 
noise tolerance properties, In general, risk minimization under 
0-1 loss is desirable because it achieves least probability of 
mis-classification. However, the optimization problem here is 
computationally hard. To overcome this, many of the classifier 
learning strategies use some convex surrogates of the 0-1 loss 
function (e.g., hinge loss, square loss etc.). The convexity of 
the resulting optimization problems makes these approaches 
computationally efficient. There have been statistical analyses 
of such methods so that one can bound risk under 0-1 loss, of 
the classifier obtained as a minimizer of risk under some other 
convex loss 0. The analysis we present here is completely 
different because the objective is to understand noise tolerance 
properties of risk minimization. Here we are interested in 
comparing minimizers of risk under the same loss function 
but under different noise conditions. 

The rest of the paper is organized as follows. In section 2 
we discuss the concept of noise tolerant learning of classifiers. 
In section 3, we present our results regarding noise tolerance 
of different loss functions. We present a few simulation results 
to support our analysis in section 4 and conclude in section 5. 

II. Noise Tolerant Learning 

When we have to learn with noisy data where class labels 
may be corrupted, we want approaches that are robust to label 
noise. Most of the standard classifiers (e.g. support vector 
machine, adaboost etc.) perform well only under noise-free 
training data; when there is label noise, they tend to over-fit. 

There are many approaches to tackle label noise in training 
data. Outliers detection |4|, restoration of clean labels for noisy 
points [5 1 and restricting the effects of noisy points on the 
classifier J6), Q are some of the well known tricks to tackle 
the label noise. However all these are mostly heuristic and also 
need extra computation. Many of them also assume uniform 
noise and sometimes assume knowledge of noise variance. 

A different approach would be to look for methods that are 
inherently noise tolerant. That is, the algorithm will handle the 
noisy data the same way that it would handle noise-free data. 
However due to some property of the algorithm, its output 
would be same whether the input is noise free or noisy data. 
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Noise tolerant learning using statistical queries [8] is one 
such approach. The algorithm learns by using some statistical 
quantities computed from the examples. That is the reason for 
its noise tolerance properties. However, the approach is mostly 
limited to binary features. Also, the appropriate statistical 
quantities to be computed depends on the type of noise and 
the type of classifier being learned. 

In this paper, we investigate the noise tolerance properties 
of the general risk minimization strategy. We formulate our 
concept of noise tolerance as explained below. For simplicity, 
we consider only the two class classification problem. 

We assume that there exists of an ideal noise-free sample 
which is unobservable but where the class label given to 
each example is correct. We represent this ideal sample by 
{(xi,y Xi ),i= 1...N}, where x, ; e 3R d l2 / X! e {-1,1}, Vt 

The actual training data given to the learning algorithms is 
obtained by corrupting these (ideal) noise-free examples by 
changing the class label on each example. The actual training 
data set would be {(xj, y Xi ), i — 1 . . . N}, where y Xi = y Xi 
with probability (1 — r/ Xi ) and is y Xi = — y Xi with probability 
f? Xi , Mi. If rj Xi = r], Vxj, then we say that the noise is uniform. 
Otherwise, we say noise is non-uniform. 

We note here that under non-uniform classification noise, 
the probability of the class label being wrong can be different 
for different examples. We assume, throughout this paper, that 
?/x < 0.5, Vx, which is reasonable. 

As a notation, we assume that the risk is defined over class 
of functions, /, that map feature space to real numbers. This 
allows us to treat all loss functions through a single notation. 
We call any such / a classifier and the class label assigned by 
it to a feature vector x would be sign(/(x)). 

Let L(-, •) be a specific loss function. For any classifier /, 
the risk under no-noise case is 

R(f) = E[L(f( X ),y x )} 

where L(., .) is the loss function. The expectation here is with 
respect to the underlying distribution of the feature vector x. 
Let /* be the minimizer of R(f). 

Under the noisy case, the risk of any classifier / is, 

RP(f) = E[L(f(x) t ju)] 

Note that y x has additional randomness due to noise corruption 
of labels and the expectation includes averaging with respect 
to that also. To emphasize this, we use the notation RP to 
denote risk under noisy case. Let /* be the minimizer of i? 17 . 

Definition 1: Risk minimization under loss function L, 
is said to be noise-tolerant if P[sign(/*(x)) = y x ] = 
P[sign(/*(x)) = y x ], where the probability is w.r.t. the 
underlying distribution of (x, y x ). 

That is, the general learning strategy of risk minimization 
under a given loss function, is said to be noise-tolerant if the 
classifier it would learn with the noisy training data has the 
same probability of misclassification as that of the classifier 
the algorithm would learn if it is given ideal or noise-free 
class labels for all training data. Noise tolerance can be 
achieved even when /* =/= /* because we are only comparing 
the probability of mis-classification of /* and /*. However, 
/* = /* is a sufficient condition for noise tolerance. 



Thinking of an ideal noise-free sample allows us to properly 
formulate the noise-tolerance property as above. We note once 
again that this noise-free sample is assumed to be unobserv- 
able. Making the probability of label corruption, rj x , to be a 
function of x would take care of most cases of noisy data. For 
example, consider a 2-class problem with overlapping class 
conditional densities where the training data are generated by 
sampling from the respective class conditional densities. Then 
we can think of the unobservable noise-free dataset to be the 
one obtained by classifying the examples using Bayes optimal 
classifier. The labels given in the actual training dataset would 
not agree with the ideal labels (because of overlapping class 
conditional densities); however, the observed labels are easily 
seen to be noisy versions where the noise probability is a 
function of the feature vector. If there are any further sources 
of noise in generating the dataset given to the algorithm, these 
can also be easily accounted for by rj x because the probability 
of wrong label for different examples can be different. 

III. Noise Tolerance of Risk Minimization 

In this section, we analyze noise tolerance property of risk 
minimization with respect to different loss functions. 

A. 0-1 Loss Function 
The 0-1 loss function is, 

L -i(/(x),t/ x ) = / {s ign(/( x ))^ x } 

where I a denotes indicator of event A. 

Theorem 1: Assume r\ x < 0.5, Vx. Then, (i). Risk mini- 
mization with 0-1 loss function is noise tolerant under uniform 
noise, (ii). In case of non-uniform noise, risk minimization 
with 0-1 loss function is noise tolerant if R(f*) = 0. 

Proof: For any /, let S(f) := {x | sign(/(x)) ? y x } 
and S c (f) = {x | sign(/(x)) = y x }. The risk for a function 
/ under no-noise case is 



R(f) = E 



{sign(/(x))# ax } 



s(f) 



dp(x) 



where dp(x) denotes that the above integral is an expectation 
integral with respect to the distribution of feature vectors. Re- 
call /* is the minimizer of R. Let A(x) := ^ s ig n (/( x ))^ a }■ 
Then risk for any / in presence of noise would be 



(l-»k)jl(x) + »k(l-A(x)) 



r/ x dp(x) + / (1 - 2r/ x )dp(x) (1) 
Js(f) 



Given any / ^ /*, we have 

R*(n-R r '(f)= I (l-2»fc)dp(x) 



(1 - 2r/ x )dp(x) (2) 

For the first part of the theorem, we consider uniform noise 
and hence r/ x = rj, Vx. From Eq.([T|l, we now get, for any 
/, R n {f) = T] + (1 - 2rj)R{f). Hence we have, - 
RP(f*) = (1 - 2r))(R(f) - #(/*)). Since /* is minimizer of 
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Fig. 1. Data for Example 1. Learning best linear classifier when the actual 
classification boundary is parabolic 



R, we have R(f) > R{f*),Vf ^ /*, which implies R r >{f) > 
R r '(f*), V/ ^ f* if i] < 0.5. Thus under uniform label noise, 
/* also minimizes R n . This completes proof of first part of 
the theorem. (The fact that risk minimization under 0-1 loss 
function is tolerant to uniform noise is known earlier (see, e.g. 
M chap 4])) 

For the second part of the theorem, t] x is no longer constant 
but we assume R(f*) = 0. This implies Jgrf*\ dp(x) = and 
hence from Eq.©, we get R r '(f*)-R r i(f) < 0if?7 x < .5,Vx. 
Thus, /*, which is minimizer of R, also minimizes R v . This 
shows that risk minimization with 0-1 loss function is is noise- 
tolerant under non-uniform noise if R(f*) — and completes 
proof of the theorem. ■ 

If R(f*) ^ 0, then risk minimization is, in general, not 
noise tolerant under non-uniform noise. We show this by a 
counter example. 

Example 1: Fig. Q] shows a binary classification prob- 
lem where examples are generated using the true classifier 
/ t *. ue (x) = sign(x 2 + x 2 ). Let the probability distribution on 
the feature space be uniformly concentrated on the training 
dataset. We note here that we get perfect classification if we 
consider quadratic classifiers. Since we want to consider the 
case where R(f*) > 0, we restrict the family of classifiers 
over which risk is minimized to linear classifiers. 

(a) Without Noise: The linear classifier which minimizes R is 
Jli n (x) = x 2 + 5 and S(f*J = {x 9 ,x 10 }. 

(b) With Noise: We now introduce non-uniform label noise 
in the data with the noise rates as follows: ry Xg = 0.125, 
Vx 3 — 0.4, r; X5 = 0.4, ?7 X7 = 0.4 and any noise rate (less than 
0.5) to rest of the points. Consider another linear classifier 
/tj n (x) = 15.5.ti + 8x2 + 10. From Fig. Q] we see that 
S (f^ n ) = {x3,x 5 ,x 7 ,xi }. Now using Eq.©, we get 

(1 - 2 Vx9 ) - (1 - 2r/ X3 ) - (1 - 2ry X5 ) - (1 - 2ry X7 ) 
16 

(1-2*0.125) -3* (1-2*0.4) 0.15 

= = > 0. 

16 16 

That is, FPiffaJ > R r >(f? in ), although R(f^) < R(f« n ). 

This example proves that risk minimization with 0-1 loss 
is, in general, not noise tolerant if R(f*) ^ 0. 

Remark 1: We note here that the assumption R(f*) = 
may not be very restrictive; mainly because the noise free 



ideal data set is only a mathematical entity and need not be 
observable. For example, we can take /* to be the Bayes 
optimal classifier and assume that the ideal data set is obtained 
by classifying samples using the Bayes optimal classifier. Then 
R(f*) — 0. This means that if we minimize risk under 0-1 
loss function with the actual training set, then the minimizer 
would be /*. 

Finally, we note that all the above analysis is applicable 
to empirical risk minimization also by simply taking p(x) (in 
Eq.([T]i and ©) to be the empirical distribution. 

While, as shown here, 0-1 loss function has impressive noise 
tolerant properties, risk minimization with this loss is difficult 
because it is a non-convex optimization problem. In machine 
learning, many other loss functions are used to make risk 
minimization computationally efficient. We will now examine 
the noise tolerance properties of other loss functions. 

B. Square Loss Function 

Square loss function is given by, 

isquare(/(x),y x ) = (/(x) - y x ) 2 

We first consider the case when the function /(x) is an affine 
function of x. Let /(x) = x T w + b = w T x, where w = 
[w b] T 6 5R d+1 and x = [x 1] T e 5R d+1 . 

Theorem 2: Risk minimization with squared error loss 
function for finding linear classifiers is noise tolerant under 
uniform noise if ?7 X = 77 < 0.5. 

Proof: For noise-free case, the risk is, 
R(w) = E [(x T w — y x ) 2 ] , whose minimizer is 
w* = [£?[xx T ]] E[5ty x ]. Risk under uniform label 
noise (?7 X = r), Vx) is given as 

i?"(w)= £ x £ Sx | x [(x T w-y x ) 2 |x] 

= (1 - V )E X [(x T w - y x ) 2 ] + V E X [(x T w + y x ) 2 ] 

which is minimized by 

W; = (1 - 27?) [£ x [xx T ]] _1 £ x [5q/ x ] = (1 - 2?7)w* 

Since we assume 77 < 0.5, we have (1 — 277) > 0. Hence 
we get, sign(x T w,*) = sign(x T w*), Vx. Which means 
P[sign(x T w*) = y x ] = P[sign(x T w*) = y x ]. Thus under 
uniform noise, least square approach to learn linear classifiers 
is noise tolerant and the proof of theorem is complete. ■ 
Corollary 1: Fisher Linear Discriminant (FLD) is noise 
tolerant under uniform label noise. 

Proof: For binary classification, FLD |2| finds direction 

T o 

w* as, w* = arg max w ^Tg^ . which is proportional to 

S w(M2 ~ Ml)' Here S B = (A*2 ~ Ml)(M2 ~ Ml) T and S W = 

Z)j=i S Xn ec i ( x n _ A*i)(x„-M i ) T . Ci, C 2 representee two 
classes and fi x , fi 2 denote corresponding means. FLD can be 
obtained as the risk minimizer under square loss function © 
chap 4] by choosing the target values as: U = Vx^ <E C\ 
and U = - j^, Vx t e C 2 , where Nj, = \C X \, N 2 = \C 2 \ and 
N = Ni + N 2 . 

When the training set is corrupted with uniform label noise, 
let C\ and be the two sets now and n\ and ii\ correspond- 
ing means. Let N% = |C^| and N% — \C 2 \. New target values 
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are: U = ^, Vx 4 e CJ and = Vx. 
pirical risk in this case is, E v (w, b) = A 



G C^. The em- 



j _i=i(w i x+&-t i ) 2 . 
Equating the derivative of with respect to 6 to zero, we 
get b = w T fi, where /j, = -k(NiHi + N 2 fi 2 ) i s tne mean 
of training set. Setting the gradient of E n with respect to w 
to zero and using the values of b and /j, we get, 



E 

i=l 



JV 



Sw 



N X N 2 



T 
fl w 



s 



w 



N 
N1N2 
N 



Sb 



i=l 

w = iV(l- 27 7 )(/i 1 



M2) 

where we have used the fact that, = (1 — T])fi 1 + r\\i, 2 and 
fj-2 = (1 — ^)a*2 + Wi- Note that 5gw oc (/i, 2 — fi ± ) for any 
w. Thus we see that, w* oc S^(/j, 2 — Mi). Thus FLD is noise 
tolerant under uniform label noise. ■ 
Remark 2: What we have shown is that risk minimization 
under square loss function is tolerant to uniform noise if we 
are learning linear classifiers. We can, in general, nonlinearly 
map feature vectors to a higher dimensional space so that the 
training set becomes linearly separable iflOl . Since uniform 
label noise in the original feature space should become uni- 
form label noise in the transformed feature space, we feel that 
Theorem 2 should be true for risk minimization under square 
loss for any family of classifiers. 

Now consider the non-uniform noise case where r/ x is not 
same for all x. Then, the risk R v is minimized by, w* = 

[£ x [xx T ]] 1 £x[(l — 2?7 x )xj/ x ]. Here, r/ x term can no longer 
be taken out of expectation. Hence, we may not get noise 
tolerance. We show that it is so by a counter example as below. 




Fig. 2. Data for Example 2. /* is the classifiers learnt when there is no 
noise. /* is the classifier learn in presence of non-uniform label noise. 



Example 2: Consider the unit circle centered at origin in 
5i 2 and data points placed on its circumference as, x, = 
[costfi sin6y T , 6 t = , i = 1...36. y Xi = 

1, i = 1 ... 18 and y Xi+ i 8 = — 1, i = 1 ... 18. Assume that 
the probability distribution on the feature space is uniformly 
concentrated on the training dataset. Let the set of classifiers 
contains only linear classifiers passing through origin, 
(a) Without Noise: In this case, risk is minimized by w* = 
[0 1.27] T . Classifier, sign(x T w*), linearly separates the two 



classes. Thus P[sign(x T w*) = y x ] = 1. 
(b) With Noise: Now let us introduce non-uniform label noise 
as follows. r) Xi = 0.4, Xj G R% U i?2, where i?i = 
{x 2 ,...,x 7 } and R 2 = {x 20 , . . . , x 25 }; ?7 Xl = for rest 
of the points. In this case, risk is minimized by w* = 
[—0.342 0.988] T . sign(x T w*) mis-classifies xi,X2,Xig and 
X20 as shown in Fig. [2] Hence P[sign(x T w*) = y x ] = | ^ 1. 

Thus square loss is not noise tolerant under non-uniform 
noise even if the risk minimizer under noise-free case achieves 
zero error and the optimal classifier is linear in parameters. 

C. Exponential Loss Function 

Exponential loss function is given by, 



£exp(/(x),2/ x ) 



exp(-y x /(x)) 



This is the effective loss function for adaboost. Exponential 
loss function does not have the noise tolerance property even 
under uniform label noise. We show this using the following 
counter example. 

Example 3: Let {(x 1 ,y Xl ),(x 2 ,y X2 ),(x 3 ,y X3 )} be the 
training dataset such that x% = 5, x 2 = 10 and x$ = 11, 
with y Xl = — 1, y X2 = — 1 and y X3 = +1. Let the probability 
distribution on the feature space be uniformly concentrated on 
the training dataset. Here, we find a linear classifier which 
minimizes the risk under exponential loss function. A linear 
classifier on the real line can be expressed as sign(a; + b). 

> Without Noise: The risk of a linear classifier without label 
noise is written as: 



m - 5 



,5+b 



,-11-6 



By equating the derivative of R(b) to zero, we get, 

1 



e 10+6_ e -ll- 6 = 



■In 



,10 



-10.5034 



sign(/(x)) = sign(a; + b*) correctly classifies all the 
points. Thus P[sign(x + b*) — y x ] = 1, 
With Noise: Now let us introduce uniform label noise 
with noise rate 77 = 0.3. The risk will be 



RP(b) 



(I-77) 



5+f> 



(e 



-5-b 



b e w+b 

-10-6 1 



,-11-6 



,11+6', 



Again equating the derivative of to zero, we get, 



(1- 
+e 

bl = 



?7)(e 5+6 + e 



10+6 



-ll-b\ 



10-6 



,ll+6\ 



0.7e- 



0.3(e" 



r){e 



-5-b 



0.7(e 5 



3 ) + 0.3e n 



b* = 

v 



-8.3052 



sign(/(x)) = sign(a: + b*) mis-classifies x 2 . Thus 
P[sign(s + b* v ) =y x ] = l^ P[sign(a: + b*) - y x }. 

Thus risk minimization under exponential loss is not noise 

tolerant even with uniform noise. 
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D. Log Loss Function 

Log loss function is given by, 

Llog(/(x),y x ) = ln(l +cxp(-y x /(x))) 

This is the effective loss function for logistic regression. Risk 
minimization with log loss function also is not noise tolerant. 
We demonstrate it using following counter example. 

Example 4: Consider the same training dataset as in 
Example 3. We find a linear classifier, sign(x + b), which 
minimizes the risk under log loss function. 

> Without Noise: The risk of a linear classifier without 
label noise is 



ln(l 



5+6\ 



ln(l 



ln(l 



R(b) 



Equating the derivative of R(b) to zero, we get, 



„5+6 



,10+b 



-11-6 



1 



=5+6 



1 



e 10+b 
„21 , . 



1 + e- 



= 



11-6 

2e zo i a + (e lti + e' 1 + e w )t z -1 = 0, 

where t = e b . Roots of this polynomial are 
-0.0034, -2.75 x 10~ 5 and 2.73 x 10~ 5 . The only 
positive root is t = 2.73 x 10 -5 . Using this value of t, 
we get b* = ln(t) = -10.5086. f(x) = x + b* classifies 
all the points correctly. Thus P[sign(x + b*) = y x ] = 1. 
With Noise: Now let us introduce uniform label noise 
with noise rate -q — 0.3. The risk will be, 

(i-v) 



RP(b) = 



ln(l + 
+ ln(l 



ln(l + e b+f >)+ln(l + e iu+0 ) + 

-11— 6\1 i V \i„f-< i „-5-6\ 



-10-6 



ln(l 



ln(l 



Equating the derivative of R v (b) to zero, we get a sixth 
degree polynomial in t = e b which has only one positive 
root. This root gives us the value of b* — ln(t) = 
—9.8607. The classifier, sign(/(a;)) = sign(x + b*) mis- 
classifies X2- Which means P[sign(.T + b*) = y x ] = |. 
Thus, P[sign(a; + 6*) = y x ] ^ P[sign(x + b*) = y x ] and log 
loss is not noise tolerant even with uniform noise. 



E. Hinge Loss Function 

This is a convex loss function and has the following form. 

L hinge(/( x )' V*) = max(0, 1 - y x /(x)) 

Support vector machine is based on minimizing risk under the 
hinge loss. Here we show that hinge loss function is not noise 
tolerant using a counter example. 

Example: 5 Consider the same training dataset as in Exam- 
ple 3. Since, with hinge loss function, we can not aribitarily 
scale the coefficient of a; to 1, we represent the linear classifier 
as sign(wa: + 6). 

> Without Noise: The risk of a linear classifier with noise- 
free training data is 



1 3 

R(w,b) = -^2max[0,l-y Xn {t 



b)} 



To find the minimizer of R(w, b), we need to solve 
1 3 

min^f,^^^ - y^£ n 

6 n=l 

s.t. 5w + b< + 6 >0 

10w + 6 < -1 + 6, £,2 > 
llw + b> 1-6, 6 >0 

The optimal solution of the above linear program is 
(w*, b*) = (54.7738, -571.221) which is also the mini- 
mizer of R(w, b). sign(w*a; + b*) classifies all the points 
correctly. Thus P[sign(w* x + b*) = y x ] = 1. 
• With Noise: Now we introduce uniform label noise with 
noise rate rj = 0.3 in the training data. The risk of a 
linear classifier in presence of uniform label noise is 

1 3 

R v {w,b) = - V [(l-r))max[0,l-y Xn (wx n + b)} 

6 n=l 

+ri max[0, 1 + y Xn (wx n + b)] 

Minimizing of R v (w, b) by solving the equivalent linear 
program as earlier, we get (w*,b*) = (0.3333, —2.6667) 
which is also the minimizer of R v (w,b). The classifier 
sign(u;*2; + b*) mis-classifies X2. Thus P[sign(w*x + 
b* v ) = y x ] = f + P[sign(w*x + b*) = y x ] . 

Thus hinge loss is not noise tolerant even under uniform noise 

even when the optimal classifier is linear. 



IV. Some Empirical Results 

In this section, we present some empirical evidence for 
our theoretical results. The main difficulty in doing such 
simulations is that there is no general purpose algorithm for 
risk minimization under 0-1 loss. Here we use the CALA-team 
algorithm proposed in ifTTl which (under sufficiently small 
learning step-size) converges to minimizer of risk under 0- 
1 loss in case of linear classifiers. Hence, here we restrict the 
simulations only to learning of linear classifiers and hence give 
experimental results on Iris recognition dataset. 

Iris recognition is a three class classification problem in 4- 
dimensions. The first class, Iris-setosa, is linearly separable 
from the other two classes, namely, Iris-versicolor and Iris- 
virginica. We consider a linearly separable 2-class problem by 
combining the latter two classes as one class. 

The original Iris data set has no label noise. We introduce 
different rates of uniform noise varying from 10% to 30%. We 
incorporated non-uniform label noise as follows. For every 
example, the probability of flipping the label is based on 
which quadrant (with respect to the first two features) the 
example falls in. The noise rate in this case is represented 
by a quadruple with i-th element representing probability of 
wrong class label if the feature vector is in i th quadrant 
(i= 1,2,3,4). 

The test examples are not corrupted with classification noise 
while the training samples are noisy. We use test error rate 
as an indicator of the noise-tolerance. We compare CALA 
algorithm for risk minimization under 0-1 loss with SVM 
(hinge loss), linear least square (square loss), and logistic 
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Noise Rate 


0-1 loss 
(CALA) 


hinge loss 

(SVM, 

C=10 3 ) 


square loss 
(Least Sq.) 


log loss 
(LogReg) 


No Noise 


97.53±0.38 


98.67 


92.67 


98.67 


Uniform 10% 


97.13±0.89 


93.60±4.25 


91.20±2.63 


92.40±2.97 


Uniform 20% 


97.00±1.34 


87.53±2.35 


89.33±2.86 


89.40±3.19 


Uniform 30% 


96.00±2.43 


82.26±8.46 


84.60±5.16 


84.33±4.83 


Non-Uniform 
15,20,25,30% 


96.47±1.49 


89.67±3.18 


91.27±1.49 


91.67±2.07 


Non-Uniform 
30,25,20,15% 


97.00±1.01 


82.47±7.04 


85.80±5.07 


85.93±5.09 



TABLE I 

Simulations results with Iris data 



regression (log loss) which are risk minimization algorithms 
under different convex loss functions. 

The results are shown in Table U For each noise rate, we 
generated ten random noisy training data sets. We show the 
mean and standard deviation of accuracy on test set with each 
of the algorithms. (The CALA algorithm [ 1 1 1 is a stochastic 
one and hence has a non-zero standard deviation even in the 
case of no-noise data). As can be seen from the table, risk 
minimization under 0-1 loss has impressive noise tolerance 
under both uniform and non-uniform label noise. Both SVM 
and logistic regression have the highest accuracy under no- 
noise; but their accuracy drops from 98% to 87% and 89% 
respectively under uniform noise rate of 20%. Linear least 
squares algorithm achieves accuracy of 92% when there is 
no noise and it drops to only 89% when 20% uniform noise 
is added, showing that it is tolerant to uniform noise. (The 
performance of Fisher linear discriminant is similar to that of 
linear least squares: it achives accuracy of 94%, 91.07%±3.05, 
89.13%±2.67, 84.27%±6.12 respectively on 0%, 10%, 20% 
and 30% uniform noise and 91.53%±1.72 and 87.67%±2.71 
on the two cases of non-uniform noise). Also, the standard 
deviations of both SVM and logistic regression are much 
larger, showing that performance of risk minimization under 
these loss functions is very sensitive to noise. 

V. Conclusion 

While learning a classifier, one has to often contend with 
noisy training data. In this paper, we presented some analysis 
to bring out the inherent noise tolerant properties of the risk 
minimization strategy under different loss functions. 

Of all the loss functions, the 0-1 loss function has best noise 
tolerant properties. We showed that it is noise tolerant under 
uniform noise and also under non-uniform noise if the risk 
minimizer achieves zero risk on uncorrupted or noise-free data. 

If we consider the case where we think of our ideal noise- 
free sample as the one obtained by classifying iid feature vec- 
tors using Bayes optimal classifier, the minimum risk achieved 
would be zero if the family of classifiers over which the risk is 
minimized includes the structure of Bayes classifier. In such 
a case, the noise-tolerance (under non-uniform label noise) 
of risk minimization implies that if we find the classifier to 
minimize risk under 0-1 loss function (treating the labels given 
in our training data as correct) we would (in a probabilistic 
sense) automatically learn the Bayes optimal classifier. This is 



an interesting result that makes risk minimization under 0-1 
loss a very attractive classifier learning strategy. 

A problem with minimizing risk under 0-1 loss function is 
that it is difficult to use any standard optimization technique 
to minimize risk due to discontinuity of loss function. Hence, 
given the noise-tolerance properties presented here, an interest- 
ing problem to address is that of some gradient-free optimiza- 
tion techniques to minimize risk under 0-1 loss function. For 
the linear classifier case, the stochastic optimization algorithm 
proposed in ifTTl (which is what we used in simulations) is one 
such algorithm. To really exploit the noise-tolerant property 
of the 0-1 loss function we need such optimization techniques 
which work for nonlinear classifiers also. 

On the other hand, risk under convex loss functions is easy 
to optimize. Many generic classifiers are based on minimizing 
risk under these convex loss function. But it is observed in 
practice that in presence of noise, these approaches over-fit. 

In this paper, we showed that these convex loss functions 
are not noise tolerant. Risk minimization under hinge loss, 
exponential loss and log loss is not noise tolerant even under 
uniform label noise. This explains the problem one faces 
with algorithms such as SVM if the class labels given are 
sometimes incorrect. We also showed that the linear least 
squares approach is noise tolerant under uniform noise but 
not under non-uniform noise. Same is shown to be true of 
Fisher linear discriminant also. 

Most algorithms for learning classifiers concentrate on 
minimizing risk under a convex loss function to make the 
optimization problem more tractable. The analysis presented 
in this paper suggests that looking for optimization techniques 
to minimize risk under 0-1 loss function may be a promising 
approach for classifier design especially when we have to learn 
from noisy training data. 
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